Reproducibile workflows

Version Control and Computational Notebooks

John Little

Duke University Libraries

Center for Data & Visualization Sciences

2024-02-05

Notebooks

Reproducibility

  • Do everything with code!

    • Helps reduce repetion errors
    • Helps avoid copy/paste barriers
    • Orchestrate workflows

Computational Notebooks

  • Authoring environment

    • Code chunks interspersed with natural language
    • aka Literate Coding
  • Easy to read and compose

  • Graceful degradation

Reports and Expressions

Using this method, reports (expressions such as slide, PDF, dashboards, ebooks, etc.) are rendered as part of the code execution


Interactivity and web applications

  • Shiny
  • Flask
  • WebR
  • Plotly Dash
  • ObjservableJS

Quarto

  • Open source / free / portable
  • Show explanatory text and code chunks
  • Show YAML format option

Quarto Notebook in RStudio

Jupyter Notebooks

Quarto

  • A scientific publishing system
  • R, Python, ObservableJS
  • Compose with standard text editors, or basic IDEs
    • IDEs: RStudio, Jupyter, VSCode

Rendered Outputs

  • Artifacts that document a body of work
  • Are reproducible and modifiable when data or techniques change
  • Easy to update natural language explanations and re-render outputs
  • Schedule emails based on report parameters

Summary of benefits

  • Using natural language clearly explain data, models, and workflows
  • Reduce dependencies on outside and undocumented steps
  • Ability to expose technical code chunks depending on audience focus
  • State of the art reproducibility
    • 21st century container for evidence-based, computationally-processed research

Version Control

Definition

  • A system to manage projects (repo)
  • A system to track how computer files change over time
  • A system that support collaborative revision
  • More than file synchronization
  • Assists in project back-ups

Git

  • Free open source
  • Wildly successful; most broadly implemented
  • In use across the globe
  • Use it on any file system
  • Track any file
  • Use it in any environment

Scalable to project size

Project Repositories

Archival vs version-control



  Zenodo logo - Posterity of milestones

Git - track evolution of workflow (i.e. transparency)

Track change


Branches

GitHub

  • Profile (store and host) git repos
  • Enable collaboration across the globe or private
  • Editorial and fine-grain control

Git + GitHub

Hubs

  • GitHub
  • GitLab
  • BitBuckent

Duke specific hubs

  • gitlab.oit.duke.edu (netId)
  • PACE
  • Anywhere that data and coding happens.

File Distribution and Collaboration

https://youtu.be/ThC3bSs-iZA?si=pC7vCy06CDGxzpuz&t=90

Push

Other project management features

https://youtu.be/ThC3bSs-iZA?si=ShxY4bylJgkxt-zK&t=100

Basic features

Git features implemented for distribution

  • Push
  • Public or Private
  • Clone / Fork
  • Pull Request
  • Pull

Clone

https://youtu.be/ThC3bSs-iZA?si=O18MREhlOpUiwJ6w&t=143

Fork / PR

https://youtu.be/ThC3bSs-iZA?si=kzj82KArR4WKyT-1&t=180

Summary

  • Git is used to track changes to your repo
  • GitHub is used to distribute your git repo and facility collaboration

Containers

Sharing your workspace

Ever been tempted to give someone else your laptop so they can play around with your projects, the code, the data, the settings and configurations?

Now you can share a copy of your computational environment

How

  • Binder: package and share reproducible computational environments
    • mybinder.org (public BinderHub portal)
  • Zenodo: general, open repository to deposit research papers, data sets, code, reports and related artifacts and connect to a citable DOI.
  • Combine GitHub releases with Zenodo to archive your milestones and share the interactive computation in a binder Hub

Binder Hub

  • Easiest: mybinder.org open and public
    • quarto use binder
  • Security demands may push you to use singularity

Steps

  1. Make a GitHub Release at project milestone(s)
  2. Connect GitHub to Zenodo
    1. Mint a DOI to a GitHub Release (persistent identifier: citation; milestones)
    2. With DOI, link to ORCID
  3. Create a publicly launchable, fully functional computation container of your work

Examples

  • https://github.com/libjohn/workshop_rfun_iterate?tab=readme-ov-file#readme
  • https://github.com/libjohn/workshop_webscraping?tab=readme-ov-file#readme